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In addition to single-nucleotide polymorphisms (SNP), copy number variation (CNV) is a major component of human genetic 
diversity. Among many whole-genome analysis platforms, SNP arrays have been commonly used for genomewide CNV 
discovery. Recently, a number of CNV defining algorithms from SNP genotyping data have been developed; however, due to 
the fundamental limitation of SNP genotyping data for the measurement of signal intensity, there are still concerns regarding 
the possibility of false discovery or low sensitivity for detecting CNVs. In this study, we aimed to verify the effect of combining 
multiple CNV calling algorithms and set up the most reliable pipeline for CNV calling with Affymetrix Genomewide SNP 5.0 
data. For this purpose, we selected the 3 most commonly used algorithms for CNV segmentation from SNP genotyping data, 
PennCNV, QuantiSNP; and BirdSuite. After defining the CNV loci using the 3 different algorithms, we assessed how many of 
them overlapped with each other, and we also validated the CNVs by genomic quantitative PGR. Through this analysis, we 
proposed that for reliable CNV-based genomewide association study using SNP array data, CNV calls must be performed with 
at least 3 different algorithms and that the CNVs consistently called from more than 2 algorithms must be used for 
association analysis, because they are more reliable than the CNVs called from a single algorithm. Our result will be helpful 
to set up the CNV analysis protocols for Affymetrix Genomewide SNP 5.0 genotyping data. 

Keywords: CNV defining algorithm, DNA copy number variations, SNP array 



Introduction 

Human genome variation has facilitated the under- 
standing of inter-individual phenotypic differences [1, 2]. In 
addition to single-nucleotide polymorphisms (SNPs), it is 
widely accepted that large-scale DNA structural variation, 
termed copy number variation (CNV), is a major component 
of human genetic diversity [2, 3]. Genomewide SNP geno- 
typing data can be used for CNV calling [4]; therefore, if we 
can get reliable CNV calls firom the SNP genotyping data, a 
CNV-based genomewide association study (GWAS) can be 
realized. Among many whole-genome CNV analysis plat- 
forms, SNP arrays have been suggested as a resource for 
CNV discovery due to their ubiquitous genome coverage and 
relatively advantageous resolution. However, despite the 
importance of CNV-disease association analysis, CNV 



calling from SNP genotyping data has not been well 
established. Affymetrix Genomewide SNP 5.0 is one of the 
commonly used SNP array platforms for SNP-GWAS as well 
as CNV analysis [5]. We previously validated the accuracy 
and reproducibility of CNVs called firom Afifymetrix SNP 
array 5.0 data by comparing the CNV calls firom 3 different 
array platforms using NEXUS software: Affymetrix SNP 
array 5.0, Agilent 2X244K CNV array, and NimbleGen 2. IM 
CNV array [6]. 

Recently, a number of CNV defining algorithms have been 
developed, which have facilitated the CNV-based GWAS 
[7-14]. However, due to the fundamental limitation of SNP 
genotyping data for the measurement of signal intensity, 
there are still concems regarding the possibility of false 
discovery or low sensitivity for detecting CNVs [15, 16]. 
Indeed, CNV calling is dependent on the types of array 
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platforms and analytic tools. Each platform and calling 
algorithm has its own advantages and disadvantages; so, one 
single algorithm or array platform is not always best for 
determination of CNVs [17-19]. Recently Pinto et al. [20] 
showed that different analytic tools applied to the same raw 
data typically yielded CNV calls with < 50% concordance 
and, using multiple algorithms, minimize the number of 
false discoveries. To remedy the potential limitations of SNP 
array for CNV detection, more than one way of CNV calling 
by using several different segmentation algorithms are 
performed, and overlapped calls are used for GWAS analysis 
[21-24]. 

In this study, we tried to verify the effect of adopting mul- 
tiple CNV calling algorithms and set up the most reliable 
pipeline for CNV calling with Affymetrix Genomewide SNP 
5.0 data. We selected the 3 most commonly used algorithms 
for CNV segmentation from SNP genotyping data, PennCNV 
QuantiSNI^ and BirdSuite. After defining the CNV loci using 
the 3 different algorithms, we assessed how many of them 
overlapped with each other, and we also validated the CNVs 
by genomic quantitative PGR (qPCR) . Finally we concluded 
that CNVs that were consistently called from more than 2 
different calling algorithms are more reliable than the CNVs 
called from a single algorithm. Our result will be helpful to 
set up the CNV analysis protocols for Affymetrix Geno- 
mewide SNP 5.0 genotyping data. 

Methods 
Study materials 

We used Affymetrix Genomewide SNP 5.0 genotyping 

data provided by the Korea Association Resource (KARE) 
consortium, Korean Genome Epidemiology Study (KoGES) . 
As a control, we purchased a HapMap cell line, GM10851, 
from Coriell Institute for Medical Research (Camden, NJ, 
USA) and extracted genomic DNA using the DNeasy Blood 
& Tissue Kit (Qiagen, Hilden, Germany). 

Pre-processing SNP array data 

Before CNV calling procedures, all required pre-pro- 
cessing procedures, including allele correction, summa- 
rization, and background correction, were performed as 
described previously by using the software provided by 
Affymetrix [5]. In brief background-corrected data were 
normalized using quantile normalization and summarized 
by median polish. 

CNV calling 

For this study, Affymetrix 5.0 SNP array (Affymetrix, 

Santa Clara, CA, USA) data of 10 subjects were randomly 
selected from the KARE dataset for CNV calling. Affymetrix 



5.0 data of NA10851 was used as a control. For defining 
CNVs, we choose 3 different segmentation algorithms, 
PennCNV [25], QuantiSNP [26] and BirdSuite [27]. 
PennCNV implements a hidden Markov model (HMM) and 
considers SNP allelic ratio distribution in addition to signal 
intensity. We only used CNVs from PennCNV with samples 
that had a standard deviation of log R ratio (LRR) smaller 
than 0.2, drift values of b-allele frequency (BAF_drift) smal- 
ler than 0.01, and waviness factor between —0.05 and 0.05. 
For QuantiSNI^ we did not apply any criteria for sample 
filtering, since QiaantiSNP uses an Objective Bayes Hidden- 
Markov Model (OB-HMM) for calibrating the model for 
fixing the false positive error rate and maximum marginal 
likelihood to set other hyperparameters. It was originally 
developed for detecting CNVs from BeadArray SNP geno- 
typing data of Ulumina, but Affymetrix data also could be 
processed as long as they have probe signal intensities and 
B-allele frequencies. Birdsuite uses exclusive copy number 
analysis routine (Canary) for CNP locus and HMM for 
finding rare CNVs. It generates a logarithm of the odds ratio 
(LOD) value for each CNV that describes the likelihood of a 
CNV relative to no CNV over a given interval, including 
flanking sequences. Of the BirdSuite calls, only the CNVs 
that have an LOD value smaller than 2 were used for fiirther 
analysis. All 3 methods were used with default values, and 
no other option was added or changed. 

Validation 

To estimate the false discovery rate of our CNV-calling 
algorithm, CNVs were validated by genomic real-time qPCR. 
For this purpose, we randomly selected 25 CNV loci and 
performed genomic qPCR using DNAs from study subjects 
who showed corresponding CNVs on that lod. The PGR 
primers used in this study were designed using the 
PrimerQuest program (http://www.idtdna.com/Scitools/ 
Applications/Primerquest). To verify the specificity of the 
PGR reactions under the unified denaturation temperature 
(60°C), we performed PGR and agarose gel electrophoresis 
for each primer set. We also screened the University of 
California Santa Cruz (UCSC) database (http://genome. 
ucsc.edu/) to confirm the unique sequence without any 
repeat sequences in the primers. Sequence information of 
the primers for genomic qPCR validation is listed in Table 1 . 

Ten microliters of the reaction mixture contained 10 ng of 
genomic DNA, SYBR Premix Ex Taq TM II (TaKaRaBio, 
Shiga, Japan), 1 x ROX (Toyobo, Osaka, Japan), and 10 pmol 
of primers. Thermal cycling conditions consisted of 1 cycle of 
10 s at 95°C, followed by 40 cycles of 5 s at 95°C, 10 s at 61°C, 
and 20 s at 72°C. All PGR experiments were repeated twice, 
and amplification efficiencies for both target and reference 
genes were evaluated using a standard curve over 1 : 5 serial 
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Table 1. Sequence information of the primers for genomic qPCR validation 



Chr 


Start 


End 


Length (bp) 


Forward 


Reverse 


1 


108,538,343 


108,538,461 


119 


AGGAGGTTGCACCATGGTTAGTCA 


G G CCACAG CACATCTTGTG AAACA 


1 


150,842,026 


150,842,185 


160 


AACCATGGACTTCGTGGGTAGTCA 


CATC CTCC ATG CATTGTG G TG G AA 


3 


259,837 


259,952 


116 


CAAATGGAACAGCAGGGTCAGCAA 


TGCTGTGTCCAGCATCCTATGTGT 


4 


69,073,047 


69,073,213 


167 


TTGTTGGAGGAACAAAGCCCAACC 


TGGCTGGTGTCTGTTCTGATTGGT 


5 


92,610,728 


92,610,891 


164 


AGGTTGATGAGCCACACAGGGTAT 


TGCTCCTGAATTCCTCAGCTTCCA 


5 


178,045,652 


1 78,045,850 


199 


AGGCAAGAGGTAGCCCACCTTAAT 


AAAGCAGGAGCTGAGAGGCAGAAA 


7 


141,704,582 


141,704,737 


156 


AAACAGACAGGCACTGGTCCATCT 


ATGGCATAACCTCCATCCCACTCA 


7 


141,711,867 


141,712,039 


173 


TGAGACTGTGGATCTTTGGCCACT 


TAATTCCACATGTCCAG G CCCACT 


7 


154,025,113 


154,025,228 


116 


TGCAATGGCACGATCTTGTCTCAC 


AGGCATGATGGTGGGTGCCTATAA 


11 


55,162,545 


55,162,631 


87 


TCTATCACGTGCACCCAGCTCATT 


TGTGGATGTGTAGCAAAGGTCGGA 


11 


55,207,777 


55,207,961 


185 


TGCACTACACCATCATCACGACCA 


ATCAATGCGAGCCAACTTCAGCAG 


14 


81,569,884 


81,569,974 


91 


TGCATGTTAGGAGGCTGTGGATCA 


TGAGGCAGAACAATGTGGCCTCTA 


14 


81,569,895 


81,569,974 


80 


AGG CTGTG G ATCAATACG G GTTCA 


TGAGGCAGAACAATGTGGCCTCTA 


16 


54,363,707 


54,363,878 


172 


AGCCGCATCTGTAGTCCTGAAAGT 


GTTCCCTCCAAAGCTGGCAATGTT 


5 


814,195 


814,318 


124 


ACCTCGGCCGGATTCTGGATTAAA 


ACCTTCATGGCAGGTGAGAAGACA 


9 


44,685,893 


44,685,973 


81 


ATGACAGACAGGACCCAACACCAT 


TCAACAATAGG G GAG AG G AAG CCA 


15 


19,145,532 


19,145,684 


153 


GGCGCAGTGGTTCATGCTTGTAAT 


TGCGTACCACCATGCCTAGCTAAA 


15 


19,833,989 


19,834,167 


179 


TGCCTAAGCTGTGTTACTCTGCCA 


CGCAAAGGGTTACAGATGGCAACA 


20 


1,531,337 


1,531,532 


196 


ATCACCCAATTGCGGACTCCTCTT 


ACAGACTCTACGGCGTTGGCTTTA 


22 


22,702,643 


22,702,799 


157 


GCCTGACTTCGAAATGGTGGCAAA 


TGGTTGCCTGGTTTCTAGCCCTAT 


2 


87,557,596 


87,557,722 


127 


TGAGAGGCAGGTGGATTTGGATGT 


TGGAAGACACACAGCGAACCTCTT 


4 


70,207,254 


70,207,437 


184 


ACCTCAAATTTCAGTGCCGAAGGC 


TCTTCTGTGCTGGCTGTGGATTCT 


15 


22,237,790 


22,237,886 


97 


AGCTCAGGAGATGAAAGGGCACAT 


TCTGCCTGAAGCAAGTGTACCTGT 


15 


19,843,004 


19,843,164 


161 


AGACTTGCCTTCTTGTCAGCCTCA 


AGACAGGGCAGGAAGAACTTTCCA 


19 


48,112,193 


48,112,355 


163 


TCCTGCATCCTCCTGTGTGACATT 


TGGCTACATCTGGTACAAAGGGCA 



qPCR, quantitative PGR; Chr, chromosome. 



Table 2. General characteristics of the copy number variation (CNV) calls from 3 algorithms 







No lillralion 






> 7 conscrulivo proves 






PennCNV 


QuantiSNP 


BirdSuite 


PennCNV 


QuantiSNP 


BirdSuite 


Total no. of calls 
Size Average 
Median 

Small CNV <1 kbp 


814 
44,672.6 
12,981 
49 (6.0%) 


1,087 
39,911 
7,599 
272 (25.0%) 


1,362 
15,466.5 
7,964 
193 (14.2%) 


295 
104,264 
50,799 
0 


311 
103,782 
50,597 
0 


385 
34,463.9 
23,528 
0 



dilutions. The copy number of each target was defined as 2- 
^ ^CT, where dCt is the difference in threshold cycles for 
the sample in question normalized against the reference 
gene (RNaseP) and expressed relative to the value obtained 
by calibrator DNA (NA10851 and Promega DNA), as 
desaibed elsewhere [28]. 

Results and Discussion 

The general characteristics of the CNV calls from the 3 
algorithms are shown in Table 2. The total number of CNV 

calls fi-om the 3 algorithms was similar-range 814 to 1,362. 
The median size of the CNVs called by PennCNV (12.9 kb) 
were bigger than the other 2 algorithms (7.5 kb and 7.9 kb). 



The size of the CNVs consistently detected by 2 or more 
algorithms was generally bigger than that of the CNVs 
detected by a single method. For example, the average sizes 
of the CNVs that were detected fi-om all 3 algorithms, 2 
algorithms, and 1 method were 107 kb, 73.2 kb, and 15.9 kb, 
respectively (Table 3). Fig. 1 shows the examples of different 
sizes of CNVs according to the LRR distribution of gain- and 
loss-type CNV regions, depending on how many algorithms 
detected the CNV in that region. These data suggest that 
larger-sized CNVs are generally more prominent; so, they 
can be relatively easily detected by any segmentation 
method, regardless of the algorithm. But, detection of the 
smaller-sized CNVs can be affected more easily by the 
characteristics of the CNV defining algorithm; so, they can 
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Table 3. Number of copy number variations (CNVs) consistently defined from 3 different algorithms 









No filtration 






>7 


consecutive 


proves 






Size 


CNV 


PennCNV QuantiSNP 


BirdSuite 


Size 


CNV 


PennCNV 


QuantiSNP 


BirdSuite 


All overlap 


107.2 


40 


40 42 


40 


208 


14 


14 


15 


15 


2 overlap 


73.2 


425 


356 424 


91 


130 


231 


212 


223 


14 


Unique 


15.9 


2,270 


418 621 


1,231 


40 


498 


69 


73 


356 



Size: average size (kb). 



(A) 



■B 0 

ro 

-2 
-3 



(B) 



(C) 



Chr4 

70.10 70.20 70.30 

Genomic position (Mb) 



1 

■a 0 

CD 

-2 
-3 



3^ 



Chr15 

19.75 19.85 19.95 

Genomic position (Mb) 



S 0 
m 

cn 

° -1 
-2 
-3 



(D) 



(E) 



1.42 1.52 1.62 

Genomic position (Mb) 



0 

TO 

1-1 

-2 
-3 



(F) 



■B 0 
ro 

cn 

° -1 
-2 
-3 

2 
1 

■S 0 

CD 

1-1 

-2 
-3 



22.60 22.70 22.80 

Genomic position (Mb) 



108.40 108.50 108.60 
Genomic position (Mb) 



54.25 54.35 54.45 

Genomic position (Mb) 



Fig. 1. Signal intensity (Log R ratio) plots of the copy number 
variation (CNV) regions. (A, B) CNVs detected from all 3 
algorithms. (C, D) CNVs detected from 2 of the 3 algorithms. (E, 
F) CNVs uniquely detected by a single algorithm. Green bars, copy 
number-loss CNV regions; red bars, copy number-gained CNV 
regions. 



be detected by 1 particular algorithm. 

A recent report examining the impact of inaccuracy of 
CNV detection from SNP genotyping data revealed that 
CNVs, defined as the copy number changes of >7 con- 
secutive probes, were fairly reliable in case of deletion type 
CNVs [29] . In our previous study exploring the CNV profiles 
of Koreans using an Affymetrix array, we suggested the 
filtering condition of CNV call as > 6 consecutive probes to 
be reliable [5] . Therefore, to make more reliable conditions 
in this study, we filtered the CNV calls as > 7 consecutive 



probes. Under this criterion, the number of CNV calls from 
the 3 algorithms was reduced down from 295 to 385, while 
the median size became bigger-50 kb by PennCNV and 
QuantiSNP and 23.5 kb by BirdSuite (Table 2). In this 
condition, same as above, the size of the CNVs consistently 
detected by 2 or more algorithms was generally bigger than 
that of the CNVs detected by a single method (Table 3). 

In terms of the number of CNVs consistently identified by 
different algorithms, of the CNV calls using the PennCNV 
algorithm, 48.6% (396/814 CNVs) was detected by 2 or 
more algorithms, and only 4.9% (40/814 CNVs) was 
detected by all 3 algorithms. Similarly, of the CNV calls using 
QuantiSNi; 42.9% (466/1,087 CNVs) was detected by 2 or 
more algorithms and only 3.9% (42/1,087 CNVs) was 
detected by all 3 algorithms. However, in the case of the 
BirdSuite algorithm, although the number of CNV calls was 
the biggest, only 9.6% (131/1,362 CNVs) was detected by 2 
or more algorithms and 2.9% (40/1,362 CNVs) was detected 
by all 3 algorithms. Taken together, only 17% of the total 
CNV calls (465/2,735 CNVs) were defined by more than 2 
algorithms, and only 1.5% (40/2,735 CNVs) was defined by 
all 3 algorithms (Table 3) . In the filtering condition of > 7 
consecutive probes, the number of CNVs consistently 
identified by different algorithms was improved. For 
example, of the CNV calls using the PennCNV algorithm, 
76.6% (226/295 CNVs) was detected by 2 or more 
algorithms and only 23.4% (69/295 CNVs) was detected 
uniquely by PennCNV QuantiSNP calls showed a similar 
trend. However, even under this condition, 92.5% (356/385 
CNVs) of BirdSuite calls were unique (Table 3). All the 
CNVs consistently detected by 2 algorithms were from 
PennCNV and QuantiSNP, while no CNVs that were 
detected by the BirdSuite algorithm were consistently 
detected by 2 algorithms, with only 1 exception. These data 
suggest that CNV calls from SNP genotyping depend 
substantially on the characteristics of calling algorithms, and 
only part of the CNV calls from each algorithm seems to be 
reliable. Therefore, CNVs identified from SNP genotyping 
data must be validated as completely as possible, and 
especially in the case of using a single calling algorithm, the 
CNVs must be validated more carefully. Filtering the CNV 
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Table 4. Genomic qPCR validation of the CNV calls from SNP 
genotyping data 



Subject 
no. 


Unique 


2 overlaps 


3 overlaps 


CNV 

call 


qPCR 


CNV 
call 


qPCR 


CNV 
call 


qPCR 


No filtration 












1 


9 


3 


2 


1 


0 


0 


2 


0 


0 


3 


2 


1 


1 


3 


4 


1 


4 


4 


0 


0 


4 


7 


3 


2 


2 


3 


2 


5 


4 


1 


4 


3 


0 


0 


6 


3 


3 


5 


4 


0 


0 


7 


4 


2 


3 


2 


1 


1 


8 


6 


3 


5 


1 


1 


1 


9 


4 


1 


5 


0 


0 


0 


10 


6 


1 


0 


0 


1 


0 




47 


18 

(38.3%) 


33 


19 
(57.6%) 


7 


5 

(71.4%) 


>7 consecutive 


proves 










1 


6 


3 


1 


1 


0 


0 


2 


0 


0 


3 


2 


1 


1 


3 


1 


1 


4 


4 


0 


0 


4 


3 


2 


1 


1 


3 


2 


5 


2 


0 


3 


3 


0 


0 


6 


0 


0 


5 


4 


0 


0 


7 


1 


1 


3 


2 


1 


1 


8 


2 


2 


5 


1 


1 


1 


9 


3 


1 


4 


0 


0 


0 


10 


5 


1 


0 


0 


0 


0 




23 


11 
(47.8%) 


29 


18 
(62.1%) 


6 


5 

(83.3%) 



qPCR, number of consistently detected CNVs by quantitative 
PCR; CNV call, number of copy number variation calls; SNP, 
single-nucleotide polymorphism. 



calls by the number of consecutive probes can improve the 

reliability of CNV calls. 

To validate the CNVs identified firom 3 different algori- 
thms in this study, we randomly selected 25 CNV regions 
across the whole chromosomes. The genomic qPCR vali- 
dation results are listed in Table 4. In case of the CNVs 
identified by all 3 algorithms, 71.4% (5/7) of the consistency 
was observed between the CNV call and genomic qPCR 
result. Of the CNVs identified by more than 2 algorithms, 
57.6% (19/33) was consistent. Of the unique CNV calls 
defined by just a single algorithm, only 38.3% (18/47) was 
consistent, but the other 61.7% (28/46) was not consistent. 
In the case of the CNVs defined as > 7 consecutive probes, 
the consistency was generally improved-47.8%) (unique 
CNVs), 62.2% (CNVs identified by more tiian 2 algorithms), 
and 83.3% (CNVs identified by all 3 algorithms). These 
results indicate that CNV calls from 2 or more algorithms are 
more reliable than those from single algorithms. CNV calls 



from all 3 algorithms are, of course, the most reliable, but 
this is too stringent; so, the number of CNVs is not 
applicable for GWAS analysis. 

In this study, we aimed to verify the effect of adopting 
multiple CNV calling algorithms and set up the most reliable 
pipeline for CNV calling with Affymetrix Genomewide SNP 
5.0 data. We found that CNVs defined by a single CNV 
calling algorithm may not reliable enough for further GWAS 
study, regardless of the types of algorithms. Based on our 
findings, we propose that for reliable CNV-based GWAS 
using SNP array data, CNV calls must be performed with at 
least 3 different algorithms, and the CNVs consistently 
called firom more than 2 methods must be used for asso- 
ciation analysis. 
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